knitr::opts_chunk$set(echo = TRUE)
library(httr)
library(rvest)
library(tidyverse)
library(lubridate)
library(pdftools)
library(pdfsearch)
library(kableExtra)
library(jsonlite)
library(plotly)



qkable <- function(x, height="360px") {
  x %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("condensed", "responsive", "striped", "hover", "bordered"), font_size = 11, position = "center") %>% scroll_box(width="100%", height= height, fixed_thead =  list(enabled = TRUE, background = "lightgrey") ) 
}

In early stages of infection outbreaks, counts of confirmed cases suggest an underlying exponential growth trend.

Let’s take a closer look.

Web scraper to compile data from WHO (World Health Organization) daily reports.

Noticing a pattern in the report URL structure, we might first try to exploit this to quickly get the files we need.

td <- Sys.Date()
firstreportdate <- as.Date("2020-01-21")
number.of.reports <- td - firstreportdate

# generate .pdf url 
# getReportUrl <- function(date = firstreportdate) {
#   return( paste0(urlstem, 
#                  gsub("-", x = date, replacement = ""), 
#                  "-sitrep-",
#                  td  - firstreportdate + 1,
#                  "-2019-ncov.pdf"
#                  ) )
# }

# proceeding would be pretty pointless, because it turns out the urls actually change structure mid-way

But of course, this is dependent upon the World Health Organization keeping up with this structure and not making errors or deviations.

It turns out that mid-way, the naming structure changes from 2/10 to 2/11, where ncov changes to covid in the URL.

In general, we shouldn’t rely on the web structure to have a neat coherent pattern behind its file naming system.

Instead, let’s look at the root page and pull all the urls from it.

# get the .pdf links from webpage
rooturl <- read_html("https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports") # all the .pdf links are here

situation.reports <- rooturl %>%
  html_nodes("a") %>% 
  html_attr("href") 
# %>%   filter(!grepl('.pdf'))

situation.reports <- situation.reports[grepl('.pdf',situation.reports)]
situation.reports <- situation.reports[grepl('coronaviruse/situation-reports/',situation.reports)] %>% unique()  # some links are duplicates
situation.reports <- paste0("https://www.who.int", situation.reports) # full link

We can check that the number of reports is correct and up to date. This difference should be 0, unless if there’s a one-off error due to timezone / reporting lag on behalf of the WHO.

length(situation.reports) - number.of.reports
## Time difference of 1 days

We can visualize our URLS and check that the dates match up correctly (especially the leap day February 29, thanks to lubridate).

sit.rep <- tibble(date = as.Date("2020-01-20") + 1:length(situation.reports), urls = rev(situation.reports)) 
sit.rep %>% qkable() 
date urls
2020-01-21 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200121-sitrep-1-2019-ncov.pdf?sfvrsn=20a99c10_4
2020-01-22 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200122-sitrep-2-2019-ncov.pdf?sfvrsn=4d5bcbca_2
2020-01-23 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200123-sitrep-3-2019-ncov.pdf?sfvrsn=d6d23643_8
2020-01-24 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200124-sitrep-4-2019-ncov.pdf?sfvrsn=9272d086_8
2020-01-25 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200125-sitrep-5-2019-ncov.pdf?sfvrsn=429b143d_8
2020-01-26 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200126-sitrep-6-2019--ncov.pdf?sfvrsn=beaeee0c_4
2020-01-27 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200127-sitrep-7-2019--ncov.pdf?sfvrsn=98ef79f5_2
2020-01-28 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200128-sitrep-8-ncov-cleared.pdf?sfvrsn=8b671ce5_2
2020-01-29 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200129-sitrep-9-ncov-v2.pdf?sfvrsn=e2c8915_2
2020-01-30 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200130-sitrep-10-ncov.pdf?sfvrsn=d0b2e480_2
2020-01-31 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200131-sitrep-11-ncov.pdf?sfvrsn=de7c0f7_4
2020-02-01 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200201-sitrep-12-ncov.pdf?sfvrsn=273c5d35_2
2020-02-02 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200202-sitrep-13-ncov-v3.pdf?sfvrsn=195f4010_6
2020-02-03 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200203-sitrep-14-ncov.pdf?sfvrsn=f7347413_4
2020-02-04 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200204-sitrep-15-ncov.pdf?sfvrsn=88fe8ad6_4
2020-02-05 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200205-sitrep-16-ncov.pdf?sfvrsn=23af287f_4
2020-02-06 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200206-sitrep-17-ncov.pdf?sfvrsn=17f0dca_4
2020-02-07 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200207-sitrep-18-ncov.pdf?sfvrsn=fa644293_2
2020-02-08 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200208-sitrep-19-ncov.pdf?sfvrsn=6e091ce6_4
2020-02-09 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200209-sitrep-20-ncov.pdf?sfvrsn=6f80d1b9_4
2020-02-10 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200210-sitrep-21-ncov.pdf?sfvrsn=947679ef_2
2020-02-11 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200211-sitrep-22-ncov.pdf?sfvrsn=fb6d49b1_2
2020-02-12 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200212-sitrep-23-ncov.pdf?sfvrsn=41e9fb78_4
2020-02-13 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200213-sitrep-24-covid-19.pdf?sfvrsn=9a7406a4_4
2020-02-14 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200214-sitrep-25-covid-19.pdf?sfvrsn=61dda7d_2
2020-02-15 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200215-sitrep-26-covid-19.pdf?sfvrsn=a4cc6787_2
2020-02-16 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200216-sitrep-27-covid-19.pdf?sfvrsn=78c0eb78_4
2020-02-17 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200217-sitrep-28-covid-19.pdf?sfvrsn=a19cf2ad_2
2020-02-18 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200218-sitrep-29-covid-19.pdf?sfvrsn=6262de9e_2
2020-02-19 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200219-sitrep-30-covid-19.pdf?sfvrsn=3346b04f_2
2020-02-20 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200220-sitrep-31-covid-19.pdf?sfvrsn=dfd11d24_2
2020-02-21 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200221-sitrep-32-covid-19.pdf?sfvrsn=4802d089_2
2020-02-22 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200222-sitrep-33-covid-19.pdf?sfvrsn=c9585c8f_4
2020-02-23 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200223-sitrep-34-covid-19.pdf?sfvrsn=44ff8fd3_2
2020-02-24 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200224-sitrep-35-covid-19.pdf?sfvrsn=1ac4218d_2
2020-02-25 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200225-sitrep-36-covid-19.pdf?sfvrsn=2791b4e0_2
2020-02-26 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200226-sitrep-37-covid-19.pdf?sfvrsn=2146841e_2
2020-02-27 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200227-sitrep-38-covid-19.pdf?sfvrsn=2db7a09b_4
2020-02-28 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200228-sitrep-39-covid-19.pdf?sfvrsn=5bbf3e7d_4
2020-02-29 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200229-sitrep-40-covid-19.pdf?sfvrsn=849d0665_2
2020-03-01 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200301-sitrep-41-covid-19.pdf?sfvrsn=6768306d_2
2020-03-02 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200302-sitrep-42-covid-19.pdf?sfvrsn=224c1add_2
2020-03-03 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200303-sitrep-43-covid-19.pdf?sfvrsn=76e425ed_2
2020-03-04 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200304-sitrep-44-covid-19.pdf?sfvrsn=93937f92_6
2020-03-05 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200305-sitrep-45-covid-19.pdf?sfvrsn=ed2ba78b_4
2020-03-06 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200306-sitrep-46-covid-19.pdf?sfvrsn=96b04adf_4
2020-03-07 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200307-sitrep-47-covid-19.pdf?sfvrsn=27c364a4_4
2020-03-08 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200308-sitrep-48-covid-19.pdf?sfvrsn=16f7ccef_4
2020-03-09 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200309-sitrep-49-covid-19.pdf?sfvrsn=70dabe61_4
2020-03-10 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200310-sitrep-50-covid-19.pdf?sfvrsn=55e904fb_2
2020-03-11 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200311-sitrep-51-covid-19.pdf?sfvrsn=1ba62e57_10
2020-03-12 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200312-sitrep-52-covid-19.pdf?sfvrsn=e2bfc9c0_4
2020-03-13 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200313-sitrep-53-covid-19.pdf?sfvrsn=adb3f72_2
2020-03-14 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200314-sitrep-54-covid-19.pdf?sfvrsn=dcd46351_8
2020-03-15 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200315-sitrep-55-covid-19.pdf?sfvrsn=33daa5cb_8
2020-03-16 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200316-sitrep-56-covid-19.pdf?sfvrsn=9fda7db2_6
2020-03-17 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200317-sitrep-57-covid-19.pdf?sfvrsn=a26922f2_4
2020-03-18 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200318-sitrep-58-covid-19.pdf?sfvrsn=20876712_2
2020-03-19 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200319-sitrep-59-covid-19.pdf?sfvrsn=c3dcdef9_2
2020-03-20 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200320-sitrep-60-covid-19.pdf?sfvrsn=d2bb4f1f_2
2020-03-21 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200321-sitrep-61-covid-19.pdf?sfvrsn=ce5ca11c_2
2020-03-22 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200322-sitrep-62-covid-19.pdf?sfvrsn=755c76cd_2
2020-03-23 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200323-sitrep-63-covid-19.pdf?sfvrsn=b617302d_4
2020-03-24 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200324-sitrep-64-covid-19.pdf?sfvrsn=723b221e_2
2020-03-25 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200325-sitrep-65-covid-19.pdf?sfvrsn=ce13061b_2
2020-03-26 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200326-sitrep-66-covid-19.pdf?sfvrsn=9e5b8b48_2
2020-03-27 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200327-sitrep-67-covid-19.pdf?sfvrsn=b65f68eb_4
2020-03-28 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200328-sitrep-68-covid-19.pdf?sfvrsn=384bc74c_4
2020-03-29 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200329-sitrep-69-covid-19.pdf?sfvrsn=8d6620fa_8
2020-03-30 https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200330-sitrep-70-covid-19.pdf?sfvrsn=7e0fe3f8_2

Now we have the list of pdf URLS to parse through and clean up. Let’s first download them offline so we can process them offline/faster.

# download offline for processing
if (! dir.exists("covid-19")) {
  dir.create("covid-19")
}
setwd("covid-19")

for (i in 1:nrow(sit.rep)) {
  filename <- paste0(getwd(),"/",sit.rep$date[i], ".pdf" )
  if (!file.exists(filename))
    download.file(url = sit.rep$urls[i], 
                  destfile = filename, 
                  mode = "wb")
}
setwd("covid-19")
pdf.paths <- paste0(sit.rep$date, ".pdf")
# getwd()

counts <- sit.rep %>% select(- "urls") %>% mutate(
              global = 0L,
              china = 0L,
              outside_china = 0L,
              # global_new = 0L,
              # china_new = 0L,
              global_deaths = 0L,
              china_deaths = 0L,
              outside_china_deaths = 0L
)


# REPORT TYPE A: Older, (actually the web scrape reveals this form was reused from the Zika virus)
# treat 1 ~ 6 MANUALLY; not worth the investment to automate this because this is an old format

editrow <- function(df, n = 0, global, china, outsidechina) {
  if ( (n <= nrow(df)) * (n >= 1) ) {
    df$global[n] <- global
    df$china[n] <- china
    df$outside_china[n] <- outsidechina
    }
  return(df)
}

# manually do first report's data
counts <- counts %>% editrow(n = 1, 
                             global = 282, 
                             china = 278, 
                             outsidechina = 3)
# "As of 20 January 2020, 282 confirmed cases of 2019-nCoV have been reported from four countries including China (278 cases), Thailand (2 cases), Japan (1 case) and the Republic of Korea (1 case);"


# report 2 : jan 22
counts <- counts %>% editrow(n = 2,
                             global = 314,
                             china = 309,
                             outsidechina = 4) # mismatch

# report 3: jan 23
counts <- counts %>% editrow(n = 3,
                             global = 581,
                             china = 571,
                             outsidechina = 10)

# report 4 : jan 24
counts <- counts %>% editrow(n = 4,
                             global = 846,
                             china = 830,
                             outsidechina = 11) # doesnt add up

# report 5 : jan 25
counts <- counts %>% editrow(n = 5,
                             global = 1320,
                             china = 1297,
                             outsidechina = 23)

# report 6 : jan 26
counts <- counts %>% editrow(n = 6,
                             global = 2014,
                             china = 1985,
                             outsidechina = 29)


# counts %>% qkable()

Data Cleaning

# head(result$line_text)
dict <- c(" confirmed", " deaths")

Let’s write some functions to automate the data cleaning.

getValues <- function(n, pattern) {
  # pattern : key from dict, such as " confirmed" or " deaths"
  # n : a particular situation report number (this function only support 7+; I manually filled out the first few)

  parsedsitrep <- pdf_text(pdf = paste0("covid-19/", pdf.paths[n]))

  result.n <- keyword_search(parsedsitrep[[1]], # care about first page only
                           keyword = pattern,
                           path = FALSE,
                           surround_lines = 0)

  txt <- result.n$line_text %>% str_split(pattern = "Globally", simplify = FALSE)
  txt <- txt %>% str_split(pattern = " China ", simplify = FALSE)
  txt <- txt %>% str_split(pattern = "•", simplify = FALSE, n = 2)


  bin <- c()
  
  for (i in 1:length(txt)) {
    for (j in 1:length(txt[[i]])) {
          mess <- txt[[i]][j]
          bin <- c(bin, substr(x = mess, 
                           start = regexpr(pattern = pattern, text = mess) - ifelse(pattern == " deaths ", 5, 8), # dont expect deaths to increase over 99,999 yet so minimize chance of picking up extraneous numbers
                           stop = regexpr(pattern = pattern, text = mess) + nchar(pattern) - 1))
    }
  }

  if ( pattern == " deaths " ) {
    count.vec <- bin %>%
      gsub(pattern = "\\.", replacement = "") %>%
      str_split(pattern = " ") %>%
      keep_numeric()
    count.vec <- count.vec[!is.na(count.vec)] %>% tail(3)
  } else if (pattern == " confirmed ") {
    count.vec <- bin %>%
      gsub(pattern = "\\.", replacement = "") %>%
      str_split(pattern = " ") %>%
      extract_numeric() # depreciated but better than parse_number()
    count.vec <- count.vec[!is.na(count.vec)] %>% tail(3)
  }
  
  return(count.vec)
}

keep_numeric <- function( obj ) {
  keep <- c()
  for (i in 1:length(obj)) {
    for (j in 1:length(obj[[i]])) {
      tmp <- try(parse_number(obj[[i]][j]))
      if (is.numeric(tmp)) {
        keep <- c(keep, tmp)
      }
    }
  }
  return(keep)
}

Here’s a working example, where there are 1381 deaths in China by February 14 and 2 deaths outside of China.

getValues( n = 56, pattern = " confirmed " )
## [1] 167515  81077  86438
getValues( n = 56, pattern = " deaths " )
## [1] 3218 3388
# troubleshooting & debugging for 3rd confirmed
n <- 28
pattern <- " confirmed "
bin <- c()

parsedsitrep <- pdf_text(pdf = paste0("covid-19/", pdf.paths[n]))

result.n <- keyword_search(parsedsitrep[[1]], # only first page
                           keyword = pattern,
                           path = FALSE,
                           surround_lines = 0)

txt <- result.n$line_text %>% str_split(pattern = "Globally", simplify = FALSE)
txt <- txt %>% str_split(pattern = "laboratory", simplify = FALSE)
txt <- txt %>% str_split(pattern = " China ", simplify = FALSE)


for (i in 1:length(txt)) {
  for (j in 1:length(txt[[i]])) {
        mess <- txt[[i]][j]
        bin <- c(bin, substr(x = mess,
                         start = regexpr(pattern = pattern, text = mess) - ifelse(pattern == " deaths ", 5, 10), # dont expect deaths to increase over 99,999 yet so minimize chance of picking up extraneous numbers
                         stop = regexpr(pattern = pattern, text = mess) + nchar(pattern)))
  }
}

if ( pattern == " deaths " ) {
  count.vec <- bin %>%
    gsub(pattern = "\\.", replacement = "") %>%
    str_split(pattern = " ") %>%
    keep_numeric()
} else if (pattern == " confirmed " ) {
  count.vec <- bin %>%
    gsub(pattern = "\\.", replacement = "") %>%
    str_split(pattern = " ") %>%
    extract_numeric() # depreciated but better than parse_number()
}

count.vec <- count.vec[!is.na(count.vec)]
count.vec

# txt
# 

for ( i in 7:(nrow(counts)) ) {
  print(paste("Now parsing", counts$date[i]))
  
  current.confirmed <- getValues(i, " confirmed ")
  current.deaths <-  getValues(i, " deaths ")

  # CONFIRMED CASES - reports always have 3 instances of "confirmed" metrics
  instances.confirmed <- sum( !is.na( current.confirmed ))
  if ( instances.confirmed != 3) {
    warning('Cannot find 3 instances of \'confirmed\' metrics in report.')
  } else {
    counts$global[i] <- current.confirmed[1]
    counts$china[i] <- current.confirmed[2]
    counts$outside_china[i] <- current.confirmed[3]
  }
  
  # DEATHS - 
  instances.deaths <- sum( !is.na( current.deaths ))
  if ( instances.deaths == 1 ) { # only have china deaths
    counts$china_deaths[i] <- current.deaths[1]
  } else if ( instances.deaths == 2 ) {
    counts$china_deaths[i] <- current.deaths[1]
    counts$outside_china_deaths[i] <- current.deaths[2]
  } else if ( instances.deaths == 3 ) {
    counts$global_deaths[i] <- current.deaths[1]
    counts$china_deaths[i] <- current.deaths[2]
    counts$outside_china_deaths[i] <- current.deaths[3]
  }
  
}
## [1] "Now parsing 2020-01-27"
## [1] "Now parsing 2020-01-28"
## [1] "Now parsing 2020-01-29"
## [1] "Now parsing 2020-01-30"
## [1] "Now parsing 2020-01-31"
## [1] "Now parsing 2020-02-01"
## [1] "Now parsing 2020-02-02"
## [1] "Now parsing 2020-02-03"
## [1] "Now parsing 2020-02-04"
## [1] "Now parsing 2020-02-05"
## [1] "Now parsing 2020-02-06"
## [1] "Now parsing 2020-02-07"
## [1] "Now parsing 2020-02-08"
## [1] "Now parsing 2020-02-09"
## [1] "Now parsing 2020-02-10"
## [1] "Now parsing 2020-02-11"
## [1] "Now parsing 2020-02-12"
## [1] "Now parsing 2020-02-13"
## [1] "Now parsing 2020-02-14"
## [1] "Now parsing 2020-02-15"
## [1] "Now parsing 2020-02-16"
## [1] "Now parsing 2020-02-17"
## [1] "Now parsing 2020-02-18"
## [1] "Now parsing 2020-02-19"
## [1] "Now parsing 2020-02-20"
## [1] "Now parsing 2020-02-21"
## [1] "Now parsing 2020-02-22"
## [1] "Now parsing 2020-02-23"
## [1] "Now parsing 2020-02-24"
## [1] "Now parsing 2020-02-25"
## [1] "Now parsing 2020-02-26"
## [1] "Now parsing 2020-02-27"
## [1] "Now parsing 2020-02-28"
## [1] "Now parsing 2020-02-29"
## [1] "Now parsing 2020-03-01"
## [1] "Now parsing 2020-03-02"
## [1] "Now parsing 2020-03-03"
## [1] "Now parsing 2020-03-04"
## [1] "Now parsing 2020-03-05"
## [1] "Now parsing 2020-03-06"
## [1] "Now parsing 2020-03-07"
## [1] "Now parsing 2020-03-08"
## [1] "Now parsing 2020-03-09"
## [1] "Now parsing 2020-03-10"
## [1] "Now parsing 2020-03-11"
## [1] "Now parsing 2020-03-12"
## [1] "Now parsing 2020-03-13"
## [1] "Now parsing 2020-03-14"
## [1] "Now parsing 2020-03-15"
## [1] "Now parsing 2020-03-16"
## [1] "Now parsing 2020-03-17"
## [1] "Now parsing 2020-03-18"
## [1] "Now parsing 2020-03-19"
## [1] "Now parsing 2020-03-20"
## [1] "Now parsing 2020-03-21"
## [1] "Now parsing 2020-03-22"
## [1] "Now parsing 2020-03-23"
## [1] "Now parsing 2020-03-24"
## [1] "Now parsing 2020-03-25"
## [1] "Now parsing 2020-03-26"
## [1] "Now parsing 2020-03-27"
## [1] "Now parsing 2020-03-28"
## [1] "Now parsing 2020-03-29"
## [1] "Now parsing 2020-03-30"
counts %>% qkable()
date global china outside_china global_deaths china_deaths outside_china_deaths
2020-01-21 282 278 3 0 0 0
2020-01-22 314 309 4 0 0 0
2020-01-23 581 571 10 0 0 0
2020-01-24 846 830 11 0 0 0
2020-01-25 1320 1297 23 0 0 0
2020-01-26 2014 1985 29 0 0 0
2020-01-27 2798 2741 37 0 80 0
2020-01-28 0 0 0 0 106 0
2020-01-29 6065 5997 68 0 132 0
2020-01-30 7736 29 82 0 170 0
2020-01-31 0 0 0 0 213 0
2020-02-01 11953 11821 132 0 259 0
2020-02-02 14557 14411 146 0 304 0
2020-02-03 0 0 0 0 361 0
2020-02-04 20630 20471 159 0 425 0
2020-02-05 24554 24363 191 0 491 0
2020-02-06 28276 28060 216 0 564 0
2020-02-07 31481 31211 270 0 637 0
2020-02-08 0 0 0 0 723 0
2020-02-09 37558 37251 307 0 812 0
2020-02-10 40554 40235 319 0 909 0
2020-02-11 43103 42708 395 0 1017 0
2020-02-12 45171 44730 441 0 1114 0
2020-02-13 46997 46550 447 0 1368 0
2020-02-14 0 0 0 0 1381 0
2020-02-15 0 0 0 0 1524 2
2020-02-16 0 0 0 0 1666 0
2020-02-17 0 0 0 0 1772 3
2020-02-18 73332 72528 804 0 1870 3
2020-02-19 0 0 0 0 2006 3
2020-02-20 75748 74675 1073 0 2121 8
2020-02-21 76769 75569 1200 0 2239 8
2020-02-22 77794 76392 1402 0 2348 11
2020-02-23 0 0 0 0 2445 0
2020-02-24 79331 77262 2069 0 2595 23
2020-02-25 80239 77780 2459 0 2666 34
2020-02-26 81109 78191 2918 0 2718 44
2020-02-27 82294 78630 3664 0 2747 57
2020-02-28 83652 78961 4691 0 2791 67
2020-02-29 85403 79394 6009 0 2838 86
2020-03-01 87137 79968 7169 0 2873 104
2020-03-02 88948 80174 8774 0 2915 128
2020-03-03 90869 80304 10565 0 2946 166
2020-03-04 93091 80422 12669 0 2984 214
2020-03-05 95324 80565 14759 0 3015 266
2020-03-06 0 0 0 0 3045 0
2020-03-07 101927 80813 21110 0 3073 413
2020-03-08 105586 80859 24727 0 3100 484
2020-03-09 109577 80904 28673 3809 3123 686
2020-03-10 113702 80924 32778 4012 3140 872
2020-03-11 118319 80955 37364 4292 3162 1130
2020-03-12 125260 80981 44279 4613 3173 1440
2020-03-13 132758 80991 51767 4955 3180 1775
2020-03-14 81021 61513 12 5392 3194 2198
2020-03-15 153517 81048 72469 5735 3204 2531
2020-03-16 167515 81077 86438 0 3218 3388
2020-03-17 16786 4910 228 68 42 4
2020-03-18 191127 74760 233 3357 9 68
2020-03-19 87108 657 19518 4084 23 1161
2020-03-20 20759 13271 473 1312 178 8
2020-03-21 22355 18877 572 5999 38 235
2020-03-22 94787 151293 1257 3438 7425 45
2020-03-23 1776 25375 37016 58 1741 465
2020-03-24 1990 27215 49444 65 1877 565
2020-03-25 2344 29631 60834 72 2008 813
2020-03-26 32442 75712 1937 2162 1065 31
2020-03-27 35249 81137 2419 105 2336 39
2020-03-28 324343 3085 38931 740 114 2508
2020-03-29 3709 42777 120792 139 2668 1973
2020-03-30 693224 103775 392757 3649 3 962

Most entries are correct, but we still need a considerable amount of cleaning. We find out this is because the WHO decide to use the term "laboratory-confirmed" instead of "confirmed" for a few, and generally, scraping from PDF sidebars like these can be difficult.

Then just starting March 17, the reports stopped publishing metrics with Global, China and Outside of China metrics. Instead, the numbers are broken down into:

At this point it really is faster to work in Excel.

setwd("covid-19")

extract_MMDD <- function(input_date = Sys.Date() ) {
  # input_date : like Sys.Date() "2020-03-25" format
  return(substring(input_date, first = nchar(input_date) + 1, last = nchar(input_date) + 5))
}

# csvfilename <- paste0(getwd(), "/", extract_MMDD(), ".csv")

update_csv <- function(path_to_file = paste0(getwd(), "/", extract_MMDD(), ".csv")) {
  
  if (!file.exists(path_to_file)) {
    write_excel_csv(x = counts, 
                    path = path_to_file)
  } else {
    print( paste( ".csv for ",  ) )
  }
}

if (!file.exists(csvfilename)) {
  write_excel_csv(x = counts,
                  path = csvfilename)
} else {
  print( paste( ".csv for ", td, "already exists.") )
}
covid <- as_tibble(read_csv(file = paste0(getwd(), "/covid-19/scrape-cleaned.csv")))
# covid <- head(scrape_cleaned, nrow(scrape_cleaned) - 2) # handle this with NA's instead 
covid$date <- mdy("1-20-2020") + 1:nrow(covid) # fix dates from csv

covid %>% qkable()
date global china outside_china global_deaths china_deaths outside_china_deaths
2020-01-21 282 278 3 0 0 0
2020-01-22 314 309 4 0 0 0
2020-01-23 581 571 10 0 0 0
2020-01-24 846 830 11 0 0 0
2020-01-25 1320 1297 23 0 0 0
2020-01-26 2014 1985 29 0 0 0
2020-01-27 2798 2741 37 80 80 0
2020-01-28 4593 4537 56 106 106 0
2020-01-29 6065 5997 68 132 132 0
2020-01-30 7736 7736 82 170 170 0
2020-01-31 9826 9720 106 213 213 0
2020-02-01 11953 11821 132 259 259 0
2020-02-02 14557 14411 146 304 304 0
2020-02-03 17391 17238 153 361 361 0
2020-02-04 20630 20471 159 425 425 0
2020-02-05 24554 24363 191 491 491 0
2020-02-06 28276 28060 216 564 564 0
2020-02-07 31481 31211 270 637 637 0
2020-02-08 34886 34598 288 723 723 0
2020-02-09 37558 37251 307 812 812 0
2020-02-10 40554 40235 319 909 909 0
2020-02-11 43103 42708 395 1017 1017 0
2020-02-12 45171 44730 441 1114 1114 0
2020-02-13 46997 46550 447 1368 1368 0
2020-02-14 49053 48548 505 1383 1381 2
2020-02-15 50580 50054 526 1526 1524 2
2020-02-16 51857 51174 683 1669 1666 3
2020-02-17 71429 70635 794 1775 1772 3
2020-02-18 73332 72528 804 1873 1870 3
2020-02-19 75204 74280 924 2009 2006 3
2020-02-20 75748 74675 1073 2129 2121 8
2020-02-21 76769 75569 1200 2247 2239 8
2020-02-22 77794 76392 1402 2359 2348 11
2020-02-23 78811 77042 1769 2462 2445 17
2020-02-24 79331 77262 2069 2618 2595 23
2020-02-25 80239 77780 2459 2700 2666 34
2020-02-26 81109 78191 2918 2762 2718 44
2020-02-27 82294 78630 3664 2804 2747 57
2020-02-28 83652 78961 4691 2858 2791 67
2020-02-29 85403 79394 6009 2924 2838 86
2020-03-01 87137 79968 7169 2977 2873 104
2020-03-02 88948 80174 8774 3043 2915 128
2020-03-03 90869 80304 10565 3112 2946 166
2020-03-04 93091 80422 12669 3198 2984 214
2020-03-05 95324 80565 14759 3281 3015 266
2020-03-06 98192 80711 17481 3380 3045 335
2020-03-07 101927 80813 21110 3486 3073 413
2020-03-08 105586 80859 24727 3584 3100 484
2020-03-09 109577 80904 28673 3809 3123 686
2020-03-10 113702 80924 32778 4012 3140 872
2020-03-11 118319 80955 37364 4292 3162 1130
2020-03-12 125260 80981 44279 4613 3173 1440
2020-03-13 132758 80991 51767 4955 3180 1775
2020-03-14 142534 81021 61513 5392 3194 2198
2020-03-15 153517 81048 72469 5735 3204 2531
2020-03-16 167515 81077 86438 6606 3218 3388
2020-03-17 179111 0 0 7426 0 0
2020-03-18 191127 0 0 7807 0 0
2020-03-19 209839 0 0 8778 0 0
2020-03-20 234073 0 0 9840 0 0
2020-03-21 266073 0 0 11183 0 0
2020-03-22 292142 0 0 12783 0 0
2020-03-23 332930 0 0 14509 0 0
2020-03-24 372755 0 0 16231 0 0
2020-03-25 414179 0 0 18440 3287 0

Data Prep

covid.confirmed <- covid %>% 
  select(c('date', 'global', 'china', 'outside_china')) %>% 
  pivot_longer( cols = - c('date'), names_to = 'region', values_to = 'count' )

covid.deaths <- covid %>%
  select(c('date', 'global_deaths', 'china_deaths', 'outside_china_deaths')) %>%
  pivot_longer( cols = - c('date'), names_to = 'region', values_to = 'count')

# set factor levels for plotting order
covid.confirmed$region <- factor(covid.confirmed$region, levels = c("global", "china", "outside_china"))
covid.deaths$region <- factor(covid.deaths$region, levels = c("global_deaths", "china_deaths", "outside_china_deaths"))


# change 0's in counts to NA to not be plotted (starting 3-17-2020, we only use global data)
covid.confirmed$count <- covid.confirmed$count %>% na_if(y = 0)
covid.deaths$count <- covid.deaths$count %>% na_if(y = 0)


covid.confirmed %>% qkable()
date region count
2020-01-21 global 282
2020-01-21 china 278
2020-01-21 outside_china 3
2020-01-22 global 314
2020-01-22 china 309
2020-01-22 outside_china 4
2020-01-23 global 581
2020-01-23 china 571
2020-01-23 outside_china 10
2020-01-24 global 846
2020-01-24 china 830
2020-01-24 outside_china 11
2020-01-25 global 1320
2020-01-25 china 1297
2020-01-25 outside_china 23
2020-01-26 global 2014
2020-01-26 china 1985
2020-01-26 outside_china 29
2020-01-27 global 2798
2020-01-27 china 2741
2020-01-27 outside_china 37
2020-01-28 global 4593
2020-01-28 china 4537
2020-01-28 outside_china 56
2020-01-29 global 6065
2020-01-29 china 5997
2020-01-29 outside_china 68
2020-01-30 global 7736
2020-01-30 china 7736
2020-01-30 outside_china 82
2020-01-31 global 9826
2020-01-31 china 9720
2020-01-31 outside_china 106
2020-02-01 global 11953
2020-02-01 china 11821
2020-02-01 outside_china 132
2020-02-02 global 14557
2020-02-02 china 14411
2020-02-02 outside_china 146
2020-02-03 global 17391
2020-02-03 china 17238
2020-02-03 outside_china 153
2020-02-04 global 20630
2020-02-04 china 20471
2020-02-04 outside_china 159
2020-02-05 global 24554
2020-02-05 china 24363
2020-02-05 outside_china 191
2020-02-06 global 28276
2020-02-06 china 28060
2020-02-06 outside_china 216
2020-02-07 global 31481
2020-02-07 china 31211
2020-02-07 outside_china 270
2020-02-08 global 34886
2020-02-08 china 34598
2020-02-08 outside_china 288
2020-02-09 global 37558
2020-02-09 china 37251
2020-02-09 outside_china 307
2020-02-10 global 40554
2020-02-10 china 40235
2020-02-10 outside_china 319
2020-02-11 global 43103
2020-02-11 china 42708
2020-02-11 outside_china 395
2020-02-12 global 45171
2020-02-12 china 44730
2020-02-12 outside_china 441
2020-02-13 global 46997
2020-02-13 china 46550
2020-02-13 outside_china 447
2020-02-14 global 49053
2020-02-14 china 48548
2020-02-14 outside_china 505
2020-02-15 global 50580
2020-02-15 china 50054
2020-02-15 outside_china 526
2020-02-16 global 51857
2020-02-16 china 51174
2020-02-16 outside_china 683
2020-02-17 global 71429
2020-02-17 china 70635
2020-02-17 outside_china 794
2020-02-18 global 73332
2020-02-18 china 72528
2020-02-18 outside_china 804
2020-02-19 global 75204
2020-02-19 china 74280
2020-02-19 outside_china 924
2020-02-20 global 75748
2020-02-20 china 74675
2020-02-20 outside_china 1073
2020-02-21 global 76769
2020-02-21 china 75569
2020-02-21 outside_china 1200
2020-02-22 global 77794
2020-02-22 china 76392
2020-02-22 outside_china 1402
2020-02-23 global 78811
2020-02-23 china 77042
2020-02-23 outside_china 1769
2020-02-24 global 79331
2020-02-24 china 77262
2020-02-24 outside_china 2069
2020-02-25 global 80239
2020-02-25 china 77780
2020-02-25 outside_china 2459
2020-02-26 global 81109
2020-02-26 china 78191
2020-02-26 outside_china 2918
2020-02-27 global 82294
2020-02-27 china 78630
2020-02-27 outside_china 3664
2020-02-28 global 83652
2020-02-28 china 78961
2020-02-28 outside_china 4691
2020-02-29 global 85403
2020-02-29 china 79394
2020-02-29 outside_china 6009
2020-03-01 global 87137
2020-03-01 china 79968
2020-03-01 outside_china 7169
2020-03-02 global 88948
2020-03-02 china 80174
2020-03-02 outside_china 8774
2020-03-03 global 90869
2020-03-03 china 80304
2020-03-03 outside_china 10565
2020-03-04 global 93091
2020-03-04 china 80422
2020-03-04 outside_china 12669
2020-03-05 global 95324
2020-03-05 china 80565
2020-03-05 outside_china 14759
2020-03-06 global 98192
2020-03-06 china 80711
2020-03-06 outside_china 17481
2020-03-07 global 101927
2020-03-07 china 80813
2020-03-07 outside_china 21110
2020-03-08 global 105586
2020-03-08 china 80859
2020-03-08 outside_china 24727
2020-03-09 global 109577
2020-03-09 china 80904
2020-03-09 outside_china 28673
2020-03-10 global 113702
2020-03-10 china 80924
2020-03-10 outside_china 32778
2020-03-11 global 118319
2020-03-11 china 80955
2020-03-11 outside_china 37364
2020-03-12 global 125260
2020-03-12 china 80981
2020-03-12 outside_china 44279
2020-03-13 global 132758
2020-03-13 china 80991
2020-03-13 outside_china 51767
2020-03-14 global 142534
2020-03-14 china 81021
2020-03-14 outside_china 61513
2020-03-15 global 153517
2020-03-15 china 81048
2020-03-15 outside_china 72469
2020-03-16 global 167515
2020-03-16 china 81077
2020-03-16 outside_china 86438
2020-03-17 global 179111
2020-03-17 china NA
2020-03-17 outside_china NA
2020-03-18 global 191127
2020-03-18 china NA
2020-03-18 outside_china NA
2020-03-19 global 209839
2020-03-19 china NA
2020-03-19 outside_china NA
2020-03-20 global 234073
2020-03-20 china NA
2020-03-20 outside_china NA
2020-03-21 global 266073
2020-03-21 china NA
2020-03-21 outside_china NA
2020-03-22 global 292142
2020-03-22 china NA
2020-03-22 outside_china NA
2020-03-23 global 332930
2020-03-23 china NA
2020-03-23 outside_china NA
2020-03-24 global 372755
2020-03-24 china NA
2020-03-24 outside_china NA
2020-03-25 global 414179
2020-03-25 china NA
2020-03-25 outside_china NA
covid.deaths %>% qkable()
date region count
2020-01-21 global_deaths NA
2020-01-21 china_deaths NA
2020-01-21 outside_china_deaths NA
2020-01-22 global_deaths NA
2020-01-22 china_deaths NA
2020-01-22 outside_china_deaths NA
2020-01-23 global_deaths NA
2020-01-23 china_deaths NA
2020-01-23 outside_china_deaths NA
2020-01-24 global_deaths NA
2020-01-24 china_deaths NA
2020-01-24 outside_china_deaths NA
2020-01-25 global_deaths NA
2020-01-25 china_deaths NA
2020-01-25 outside_china_deaths NA
2020-01-26 global_deaths NA
2020-01-26 china_deaths NA
2020-01-26 outside_china_deaths NA
2020-01-27 global_deaths 80
2020-01-27 china_deaths 80
2020-01-27 outside_china_deaths NA
2020-01-28 global_deaths 106
2020-01-28 china_deaths 106
2020-01-28 outside_china_deaths NA
2020-01-29 global_deaths 132
2020-01-29 china_deaths 132
2020-01-29 outside_china_deaths NA
2020-01-30 global_deaths 170
2020-01-30 china_deaths 170
2020-01-30 outside_china_deaths NA
2020-01-31 global_deaths 213
2020-01-31 china_deaths 213
2020-01-31 outside_china_deaths NA
2020-02-01 global_deaths 259
2020-02-01 china_deaths 259
2020-02-01 outside_china_deaths NA
2020-02-02 global_deaths 304
2020-02-02 china_deaths 304
2020-02-02 outside_china_deaths NA
2020-02-03 global_deaths 361
2020-02-03 china_deaths 361
2020-02-03 outside_china_deaths NA
2020-02-04 global_deaths 425
2020-02-04 china_deaths 425
2020-02-04 outside_china_deaths NA
2020-02-05 global_deaths 491
2020-02-05 china_deaths 491
2020-02-05 outside_china_deaths NA
2020-02-06 global_deaths 564
2020-02-06 china_deaths 564
2020-02-06 outside_china_deaths NA
2020-02-07 global_deaths 637
2020-02-07 china_deaths 637
2020-02-07 outside_china_deaths NA
2020-02-08 global_deaths 723
2020-02-08 china_deaths 723
2020-02-08 outside_china_deaths NA
2020-02-09 global_deaths 812
2020-02-09 china_deaths 812
2020-02-09 outside_china_deaths NA
2020-02-10 global_deaths 909
2020-02-10 china_deaths 909
2020-02-10 outside_china_deaths NA
2020-02-11 global_deaths 1017
2020-02-11 china_deaths 1017
2020-02-11 outside_china_deaths NA
2020-02-12 global_deaths 1114
2020-02-12 china_deaths 1114
2020-02-12 outside_china_deaths NA
2020-02-13 global_deaths 1368
2020-02-13 china_deaths 1368
2020-02-13 outside_china_deaths NA
2020-02-14 global_deaths 1383
2020-02-14 china_deaths 1381
2020-02-14 outside_china_deaths 2
2020-02-15 global_deaths 1526
2020-02-15 china_deaths 1524
2020-02-15 outside_china_deaths 2
2020-02-16 global_deaths 1669
2020-02-16 china_deaths 1666
2020-02-16 outside_china_deaths 3
2020-02-17 global_deaths 1775
2020-02-17 china_deaths 1772
2020-02-17 outside_china_deaths 3
2020-02-18 global_deaths 1873
2020-02-18 china_deaths 1870
2020-02-18 outside_china_deaths 3
2020-02-19 global_deaths 2009
2020-02-19 china_deaths 2006
2020-02-19 outside_china_deaths 3
2020-02-20 global_deaths 2129
2020-02-20 china_deaths 2121
2020-02-20 outside_china_deaths 8
2020-02-21 global_deaths 2247
2020-02-21 china_deaths 2239
2020-02-21 outside_china_deaths 8
2020-02-22 global_deaths 2359
2020-02-22 china_deaths 2348
2020-02-22 outside_china_deaths 11
2020-02-23 global_deaths 2462
2020-02-23 china_deaths 2445
2020-02-23 outside_china_deaths 17
2020-02-24 global_deaths 2618
2020-02-24 china_deaths 2595
2020-02-24 outside_china_deaths 23
2020-02-25 global_deaths 2700
2020-02-25 china_deaths 2666
2020-02-25 outside_china_deaths 34
2020-02-26 global_deaths 2762
2020-02-26 china_deaths 2718
2020-02-26 outside_china_deaths 44
2020-02-27 global_deaths 2804
2020-02-27 china_deaths 2747
2020-02-27 outside_china_deaths 57
2020-02-28 global_deaths 2858
2020-02-28 china_deaths 2791
2020-02-28 outside_china_deaths 67
2020-02-29 global_deaths 2924
2020-02-29 china_deaths 2838
2020-02-29 outside_china_deaths 86
2020-03-01 global_deaths 2977
2020-03-01 china_deaths 2873
2020-03-01 outside_china_deaths 104
2020-03-02 global_deaths 3043
2020-03-02 china_deaths 2915
2020-03-02 outside_china_deaths 128
2020-03-03 global_deaths 3112
2020-03-03 china_deaths 2946
2020-03-03 outside_china_deaths 166
2020-03-04 global_deaths 3198
2020-03-04 china_deaths 2984
2020-03-04 outside_china_deaths 214
2020-03-05 global_deaths 3281
2020-03-05 china_deaths 3015
2020-03-05 outside_china_deaths 266
2020-03-06 global_deaths 3380
2020-03-06 china_deaths 3045
2020-03-06 outside_china_deaths 335
2020-03-07 global_deaths 3486
2020-03-07 china_deaths 3073
2020-03-07 outside_china_deaths 413
2020-03-08 global_deaths 3584
2020-03-08 china_deaths 3100
2020-03-08 outside_china_deaths 484
2020-03-09 global_deaths 3809
2020-03-09 china_deaths 3123
2020-03-09 outside_china_deaths 686
2020-03-10 global_deaths 4012
2020-03-10 china_deaths 3140
2020-03-10 outside_china_deaths 872
2020-03-11 global_deaths 4292
2020-03-11 china_deaths 3162
2020-03-11 outside_china_deaths 1130
2020-03-12 global_deaths 4613
2020-03-12 china_deaths 3173
2020-03-12 outside_china_deaths 1440
2020-03-13 global_deaths 4955
2020-03-13 china_deaths 3180
2020-03-13 outside_china_deaths 1775
2020-03-14 global_deaths 5392
2020-03-14 china_deaths 3194
2020-03-14 outside_china_deaths 2198
2020-03-15 global_deaths 5735
2020-03-15 china_deaths 3204
2020-03-15 outside_china_deaths 2531
2020-03-16 global_deaths 6606
2020-03-16 china_deaths 3218
2020-03-16 outside_china_deaths 3388
2020-03-17 global_deaths 7426
2020-03-17 china_deaths NA
2020-03-17 outside_china_deaths NA
2020-03-18 global_deaths 7807
2020-03-18 china_deaths NA
2020-03-18 outside_china_deaths NA
2020-03-19 global_deaths 8778
2020-03-19 china_deaths NA
2020-03-19 outside_china_deaths NA
2020-03-20 global_deaths 9840
2020-03-20 china_deaths NA
2020-03-20 outside_china_deaths NA
2020-03-21 global_deaths 11183
2020-03-21 china_deaths NA
2020-03-21 outside_china_deaths NA
2020-03-22 global_deaths 12783
2020-03-22 china_deaths NA
2020-03-22 outside_china_deaths NA
2020-03-23 global_deaths 14509
2020-03-23 china_deaths NA
2020-03-23 outside_china_deaths NA
2020-03-24 global_deaths 16231
2020-03-24 china_deaths NA
2020-03-24 outside_china_deaths NA
2020-03-25 global_deaths 18440
2020-03-25 china_deaths 3287
2020-03-25 outside_china_deaths NA

Quick Data Visualization

covid.confirmed %>% ggplot() + geom_jitter(aes(x = `date`, y = `count`, color = `region`)) + ggtitle('COVID-19 Confirmed Counts, Reported by WHO')

covid.confirmed %>% ggplot() + geom_point(aes(x = `date`, y = `count`, color = `region`)) + ggtitle('COVID-19 Confirmed Counts, Reported by WHO') + facet_grid( ~ `region`)

True to the plots, there was a huge discontinuous jump in the reported numbers of confirmed counts from February 16 to February 17.

From Situation Report 28 on February 17, we see an explanation:

"From today, WHO will be reporting all confirmed cases, including both laboratory-confirmed as previously reported, and those reported as clinically diagnosed (currently only applicable to Hubei province, China). From 13 February through 16 February, we reported only laboratory confirmed cases for Hubei province as mentioned in the situation report published on 13 February. The change in reporting is now shown in the figures. This accounts for the apparent large increase in cases compared to prior situation reports.

It’s important to understand this change in reporting before conducting estimations and projections.

covid.deaths %>% ggplot() + geom_point(aes(x = `date`, y = `count`, color = `region`)) + ggtitle('Coronavirus Death Counts, Reported by WHO')
## Warning: Removed 53 rows containing missing values (geom_point).

There certainly are signs of stabilization in death counts within China, but it is of course still too early to say this conclusively. Notice that the cumulatie number of deaths outside of china has surpassed that from within China, and as such, the World Health Organization has changed its reporting structure to offer data at a much finer granularity.

Using Global Cross-verified Data

Most developers and researchers are using data gathered and provided by either the team at Johns Hopkins CSSE or cleaned and normalized data from DataHub.

Let’s first take a look at DataHub’s timeseries data.

#from DataHub
json_file <- 'https://datahub.io/core/covid-19/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# get list of all resources:
print(json_data$resources$name)

# print all tabular data(if exists any)
for (i in 1:length(json_data$resources$datahub$type)) {
  if (json_data$resources$datahub$type[i] == 'derived/csv') {
    path_to_file = json_data$resources$path[i]
    data <- read.csv(url(path_to_file))
    print(data)
  }
}
setwd("covid-19")
td <- Sys.Date()


covid_combined <- as_tibble(read_csv(file =  "https://datahub.io/core/covid-19/r/time-series-19-covid-combined.csv"))
covid_keycountries_counts <- as_tibble( read_csv("https://datahub.io/core/covid-19/r/key-countries-pivoted.csv") )



covid_combined <- as_tibble(read_csv(file = "https://datahub.io/core/covid-19/r/time-series-19-covid-combined.csv"))
covid_keycountries_counts <- as_tibble(read_csv("https://datahub.io/core/covid-19/r/key-countries-pivoted.csv"))


# normalize some data (US, UK)
covid_combined$`Country/Region`[which(covid_combined$`Country/Region` == "US")] <- "United States"
covid_keycountries_counts <- covid_keycountries_counts %>% rename(`United States` = US)
covid_keycountries_counts <- covid_keycountries_counts %>% rename(`United Kingdom` = `United_Kingdom`)


covid_keycountries_counts_pivoted <- covid_keycountries_counts %>%
  pivot_longer(cols = - c("Date"),
               names_to = "Regions",
               values_to = "Confirmed Counts")
                
# get death counts of key countries
keycountries <- covid_keycountries_counts_pivoted$Regions %>% unique()
    

covid_keycountries_deaths_pivoted <- covid_combined %>% 
  filter(`Country/Region` %in% keycountries) %>%
  select(c(Date, `Country/Region`, Deaths)) 

  covid_keycountries_deaths <- covid_keycountries_deaths_pivoted %>%
    pivot_wider(names_from = `Country/Region`,
                values_from = Deaths,
                values_fn = list(Deaths = max))

covid_keycountries_recovered_pivoted <- covid_combined %>% 
  filter(`Country/Region` %in% keycountries) %>%
  select(c(Date, `Country/Region`, Recovered)) 

  covid_keycountries_recovered <- covid_keycountries_recovered_pivoted %>%
    pivot_wider(names_from = `Country/Region`,
                values_from = Recovered,
                values_fn = list(Recovered = max))


mostrecent <- max(covid_keycountries_counts$Date)
ranking_confirmed <- covid_keycountries_counts_pivoted %>% 
  filter(Date == mostrecent) %>% 
  select(c(Regions, `Confirmed Counts`)) %>%
  arrange( desc(`Confirmed Counts`) ) %>%
  pull('Regions')

# reorder columns in decreasing order:
covid_keycountries_counts <- covid_keycountries_counts %>% select(c(Date, ranking_confirmed))





plotCases <- covid_keycountries_counts_pivoted %>% 
  plot_ly(
    x = ~ Date,
    y = ~ `Confirmed Counts`,
    color = ~ Regions,
    mode = 'lines'
  ) %>% layout(title = 'Confirmed Cases of COVID-19 by Region',
    legend = list(x = 0, y = 1, font = list(size = 12), bgcolor = "#F2F2F2"))

plotDeaths <- covid_keycountries_deaths_pivoted %>%
  plot_ly(
    x = ~ Date,
    y = ~ Deaths,
    color = ~ `Country/Region`,
    mode = 'lines'
  ) %>% layout(title = 'Deaths to COVID-19 by Region',
    legend = list(x = 0, y = 1, font = list(size = 12), bgcolor = "#F2F2F2"))

plotCases
plotDeaths

Cautions against using premature predictions.

There has been a lot of speculation regarding the future of the novel coronavirus, and for good reason, as the financial markets certainly adjust to news of outbreaks and as many peoples’ lives and international plans are modified due to COVID-19 concerns.

According to this The Atlantic article, "every country’s numbers are the result of a specific set of testing and accounting regimes… even though these inconsistencies are public and plain, people continue to rely on charts showing different numbers… it encourages dangerous behavior such as cutting back testing to bring a country’s numbers down or slow-walking testing to keep a country’s numbers low.

Special attention needs to be paid towards policy shifts such as the recent lockdowns, curfews, and transitions.

Near-Future Direction

Because the total number of cases is only determined by actual tests, we should expect there to be a great deal of noise obscuring the true “signal”. Any model we run should add a potential “noise” factor in the estimation, perhaps by adding a random number of cases at each data point that is relative to the population in consideration (i.e. add a number of possible cases in percentage to the population of the country).

The following is a minor consideration, but paradoxically, if there are more cases than reported, then perhaps some areas are saturated with infected people. We can add a bias factor in the prediction that “underestimates” the value in order to add in this consideration.

We must of course add an “improvement factor” accounting for the discontinuity in reported numbers as (presumably)

Model Selection

Because we are only in the early phase of the outbreak on a global scale, we want to consider a few models:

Generalized Logistic Model: \[\frac{d C(t)}{dt} = r C^p (t) \left( 1 - \frac{C(t)}{K}\right)\]

Logistic Growth Model: \[ \frac{d C(t)}{dt} = r C(t) \left( 1 - \frac{C(t)}{K}\right)\]

Generalized Growth Model \[\frac{d C(t)}{dt} = r C^p (t) \]

Assumptions and Remarks:

The generalized growth model can let us tune to sub-exponential growth of the COVID-19 outbreak in the early phase; however, it fails to describe the decay of the incedence rate. We can use this as a rough upper bound for our estimation, assuming that the outbreak continues to grow at a rate representated by historical data.

Because the Logistic Growth Model and Generalized Logistic Model assume a logistic decay of the growth rate along the time horizon, we should expect these Logistic models to provide a lower bound estimate for future counts. The Generalized Logistic Model allows for an early sub-exponential growth and is better suited to explain asymmetries in the growth and decay (and may better capture government policy efforts with isolation, quarantine and lockdown).